Data Preparation and Processing
The process began with data preparation and processing. The first step involved calculating crime rates. Since official crime rate statistics are not directly provided at the LSOA level, we performed this calculation independently. We utilized LSOA-level usual resident population data from the 2021 Census and crime record data from the Metropolitan Police Service for the year 2021.
## # A tibble: 6 × 6
## `LSOA Code` `LSOA Name` Borough Total_Crime_Count Population Crime_Rate
## <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 E01000006 Barking and Dagen… E09000… 94 1845 50.9
## 2 E01000007 Barking and Dagen… E09000… 507 2908 174.
## 3 E01000008 Barking and Dagen… E09000… 224 1795 125.
## 4 E01000009 Barking and Dagen… E09000… 298 1804 165.
## 5 E01000011 Barking and Dagen… E09000… 111 1701 65.3
## 6 E01000012 Barking and Dagen… E09000… 142 2347 60.5
##
## Success! The result file 'LSOA_Crime_Rate_2021_With_Names.csv' now includes LSOA names.
Data Processing and Integration
Following the initial data preparation, the next stage involved processing and merging the analytical dataset. With the exception of crime rates, all primary variables in this study were sourced from the 2021 Census. The decision to utilize this specific dataset was driven by considerations of data completeness and consistency in formatting. Using the 2021 Census ensures the highest degree of reliability for the analysis results.
The following section details the specific selection of variables and their subsequent naming conventions. Finally, all variables were merged into a single comprehensive working dataset. Based on prior research and our preliminary analysis, logarithmic transformations were applied to selected variables at this stage to address skewness.
##
## Merge Successful!
## Total rows: 4988
## Total variables: 38
## Confirmed inclusion of variable: pct_level4_qual (Higher Education)
## File saved as: London_LSOA_Final_Model_Data_v3.csv
Descriptive Analysis
Subsequently, we conducted a descriptive statistical analysis of crime rate data across London boroughs. The results clearly distinguish between high-risk and low-risk areas within the city, revealing significant spatial disparities.
## [1] "Statistics table saved: LSOA_Descriptive_Statistics_Report.csv"
## png
## 2
## [1] "Chart_3_Correlation_Matrix.png"
##
## All charts generated! Please check the PNG images in the folder.
Exploratory Data Analysis
This section initiates the exploratory analysis phase. We began by conducting normality tests on the primary variables, which revealed that the majority of demographic variables (e.g., religious composition) exhibited significant distributional skewness. To address this, logarithmic transformations were applied to key variables in advance. Furthermore, to mitigate the impact of this skewness, a combination of Spearman and Pearson correlation methods was employed in the subsequent correlation analysis.
##
## --- LSOA Variable Distribution Assessment Report (Based on Skewness) ---
## variable statistic p.value skewness
## 1 Crime_Rate 0.3685777 3.239666e-85 13.23432014
## 2 pct_jewish 0.2828450 4.427402e-88 7.01130205
## 3 pct_sikh 0.3514268 8.157895e-86 6.03373874
## 4 pct_16_19 0.7187986 3.803169e-68 5.06866237
## 5 Job_Density 0.7569391 2.809444e-65 4.39164476
## 6 pct_buddhist 0.7993706 1.341829e-61 3.28781997
## 7 pct_hindu 0.6503941 1.504138e-72 3.06709844
## 8 Pop_Density 0.8411438 2.862744e-57 2.74751781
## 9 pct_20_24 0.7983395 1.073557e-61 2.64266145
## 10 pct_new_migrant 0.8496540 2.845226e-56 1.80930161
## 11 pct_muslim 0.8470371 1.388409e-56 1.63254118
## 12 pct_born_asia 0.8495826 2.789701e-56 1.61000431
## 13 pct_born_africa 0.9017181 7.084446e-49 1.39546005
## 14 pct_born_americas 0.9098315 1.922297e-47 1.26977301
## 15 log_pop_density 0.9490044 1.622746e-38 -1.04887880
## 16 log_job_density 0.9565763 3.573233e-36 -0.92806515
## 17 log_crime_rate 0.9642122 1.863186e-33 0.83580658
## 18 pct_male 0.9535228 3.731902e-37 0.74908694
## 19 pct_bad_health 0.9722962 4.973730e-30 0.72502361
## 20 pct_overcrowded 0.9502477 3.758313e-38 0.71848301
## 21 pct_disabled 0.9739052 2.946788e-29 0.70969342
## 22 pct_elementary_occup 0.9556265 1.746430e-36 0.69761753
## 23 pct_born_europe 0.9784964 7.864644e-27 -0.61895647
## 24 pct_christian 0.9742063 4.149803e-29 -0.59418139
## 25 pct_unemployed 0.9786538 9.674061e-27 0.58216061
## 26 pct_private_rented 0.9797458 4.204051e-26 0.47894220
## 27 pct_level4_qual 0.9645341 2.480384e-33 0.39516542
## 28 pct_no_qual 0.9891000 3.436640e-19 0.19061416
## 29 pct_deprived 0.9890197 2.890549e-19 0.16480399
## 30 pct_no_religion 0.9869718 4.600704e-21 -0.10708486
## 31 pct_hh_with_disabled 0.9963926 1.145266e-09 -0.08995187
## dist_type suggestion
## 1 Highly Skewed Suggest Log Transform
## 2 Highly Skewed Suggest Log Transform
## 3 Highly Skewed Suggest Log Transform
## 4 Highly Skewed Suggest Log Transform
## 5 Highly Skewed Suggest Log Transform
## 6 Highly Skewed Suggest Log Transform
## 7 Highly Skewed Suggest Log Transform
## 8 Highly Skewed Suggest Log Transform
## 9 Highly Skewed Suggest Log Transform
## 10 Highly Skewed Suggest Log Transform
## 11 Highly Skewed Suggest Log Transform
## 12 Highly Skewed Suggest Log Transform
## 13 Highly Skewed Suggest Log Transform
## 14 Highly Skewed Suggest Log Transform
## 15 Highly Skewed Suggest Log Transform
## 16 Slight Skew (Good) Keep Original
## 17 Slight Skew (Good) Keep Original
## 18 Slight Skew (Good) Keep Original
## 19 Slight Skew (Good) Keep Original
## 20 Slight Skew (Good) Keep Original
## 21 Slight Skew (Good) Keep Original
## 22 Slight Skew (Good) Keep Original
## 23 Slight Skew (Good) Keep Original
## 24 Slight Skew (Good) Keep Original
## 25 Slight Skew (Good) Keep Original
## 26 Approx Normal (Excellent) Keep Original
## 27 Approx Normal (Excellent) Keep Original
## 28 Approx Normal (Excellent) Keep Original
## 29 Approx Normal (Excellent) Keep Original
## 30 Approx Normal (Excellent) Keep Original
## 31 Approx Normal (Excellent) Keep Original
Correlation Analysis
Following the logarithmic transformation of key variables, we utilized both Spearman and Pearson correlation tests to investigate the primary research question regarding the relationship between socio-economic variables and crime rates. The results were visualized using a lollipop chart to provide a clear comparison of effect sizes. The interpretation of these findings primarily relies on the Spearman rank correlation method, given its robustness against non-normal data distributions compared to Pearson’s method.
## List of independent variables for correlation analysis:
## [1] "log_job_density" "pct_private_rented" "pct_overcrowded"
## [4] "pct_new_migrant" "pct_christian" "pct_muslim"
## [7] "pct_hindu" "pct_jewish" "pct_sikh"
## [10] "pct_buddhist" "pct_no_religion" "pct_male"
## [13] "pct_unemployed" "pct_disabled" "pct_bad_health"
## [16] "pct_no_qual" "pct_level4_qual" "pct_16_19"
## [19] "pct_20_24" "pct_deprived" "pct_born_europe"
## [22] "pct_born_africa" "pct_born_asia" "pct_born_americas"
## [25] "pct_elementary_occup" "pct_hh_with_disabled" "log_pop_density"
##
## --- Correlation Analysis Results: Variables vs [Log Crime Rate] ---
## Variable Pearson_r Pearson_p Spearman_rho Spearman_p
## cor...1 pct_unemployed 0.393 <2e-16 0.471 <2e-16
## cor...2 pct_overcrowded 0.305 <2e-16 0.401 <2e-16
## cor...3 pct_20_24 0.389 <2e-16 0.401 <2e-16
## cor...4 pct_private_rented 0.411 <2e-16 0.379 <2e-16
## cor...5 pct_born_americas 0.346 <2e-16 0.374 <2e-16
## cor...6 pct_new_migrant 0.393 <2e-16 0.373 <2e-16
## cor...7 pct_deprived 0.291 <2e-16 0.368 <2e-16
## cor...8 pct_bad_health 0.257 <2e-16 0.333 <2e-16
## cor...9 pct_born_africa 0.221 <2e-16 0.312 <2e-16
## cor...10 pct_born_europe -0.281 <2e-16 -0.302 <2e-16
## cor...11 pct_muslim 0.201 <2e-16 0.295 <2e-16
## cor...12 pct_elementary_occup 0.215 <2e-16 0.277 <2e-16
## cor...13 pct_disabled 0.194 <2e-16 0.262 <2e-16
## cor...14 pct_hindu -0.185 <2e-16 -0.255 <2e-16
## cor...15 log_pop_density 0.122 <2e-16 0.214 <2e-16
## cor...16 log_job_density 0.118 <2e-16 0.199 <2e-16
## cor...17 pct_christian -0.153 <2e-16 -0.164 <2e-16
## cor...18 pct_no_qual 0.104 1.52e-13 0.161 <2e-16
## cor...19 pct_sikh -0.053 0.000189 -0.147 <2e-16
## cor...20 pct_buddhist 0.142 <2e-16 0.135 <2e-16
## cor...21 pct_hh_with_disabled 0.023 0.108 0.105 8.85e-14
## cor...22 pct_born_asia 0.088 4.03e-10 0.075 9.47e-08
## cor...23 pct_male 0.017 0.222 -0.047 0.000862
## cor...24 pct_16_19 0.051 0.000277 0.016 0.269
## cor...25 pct_level4_qual 0.052 0.000241 0.014 0.339
## cor...26 pct_no_religion 0.035 0.0147 0.011 0.452
## cor...27 pct_jewish -0.060 2e-05 0.008 0.591
## Significant
## cor...1 YES
## cor...2 YES
## cor...3 YES
## cor...4 YES
## cor...5 YES
## cor...6 YES
## cor...7 YES
## cor...8 YES
## cor...9 YES
## cor...10 YES
## cor...11 YES
## cor...12 YES
## cor...13 YES
## cor...14 YES
## cor...15 YES
## cor...16 YES
## cor...17 YES
## cor...18 YES
## cor...19 YES
## cor...20 YES
## cor...21 YES
## cor...22 YES
## cor...23 YES
## cor...24 NO
## cor...25 NO
## cor...26 NO
## cor...27 NO
Education Variable Screening
Correlation analysis revealed a surprisingly weak association between educational indicators and crime rates. This finding diverges from prior research and conventional wisdom, which typically posit a strong link between educational attainment and crime. Given this discrepancy, we conducted an independent test specifically for the education variable. The results confirmed that its effect was not statistically significant in this context. Consequently, the variable representing higher educational qualifications was excluded from the subsequent modeling process.
##
## LSOA Level Education vs Crime: Masking Effect Analysis
## ==============================================================================================
## Dependent variable:
## --------------------------------------------------------------------------
## Crime Rate (Log)
## Low Edu (Univariate) High Edu (Univariate) High Edu (Controlled)
## (1) (2) (3)
## ----------------------------------------------------------------------------------------------
## Pct No Qual 0.010***
## (0.001)
##
## Pct Level 4+ 0.002*** 0.001
## (0.001) (0.001)
##
## Job Density (Log) 0.085***
## (0.011)
##
## Constant 4.076*** 4.126*** 3.491***
## (0.022) (0.029) (0.088)
##
## ----------------------------------------------------------------------------------------------
## Observations 4,988 4,988 4,988
## R2 0.011 0.003 0.014
## Adjusted R2 0.011 0.003 0.014
## Residual Std. Error 0.582 (df = 4986) 0.584 (df = 4986) 0.581 (df = 4985)
## F Statistic 54.851*** (df = 1; 4986) 13.501*** (df = 1; 4986) 36.170*** (df = 2; 4985)
## ==============================================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
##
## Analysis complete! Please check the difference between the two plots in 'LSOA_Education_Paradox_Analysis.png'.
Linearity Assessment and Variable Transformation
Following the initial correlation screening, we employed scatter plots to visually assess the linearity and strength of association for the remaining high-potential variables. The majority of these key predictors demonstrated satisfactory linearity, justifying their retention for the exploratory regression phase. However, diagnostic observations revealed that the Youth and Migrant population variables exhibited residual skewness. To address this and improve model fit, we proceeded to conduct a comparative analysis using logarithmic transformations for these specific demographic indicators.
##
## Charts Generated:
## 1. LSOA_Scatter_Strong.png (Strong Correlation)
## 2. LSOA_Scatter_Moderate.png (Moderate/Characteristic)
Logarithmic Transformation Strategy
We proceeded to conduct an exploratory analysis using logarithmic transformations for these variables. The results confirmed that the transformed data aligned more consistently with the underlying analytical logic and statistical assumptions (e.g., linearity and normality). Consequently, logarithmic transformations were formally applied to these variables for the subsequent regression analysis.
##
## --- Skewness Improvement Report ---
## 1. Youth Population (pct_20_24):
## Raw Skewness: 2.643 -> Log Skewness: 0.786 (Significant Improvement)
## 2. New Migrants (pct_new_migrant):
## Raw Skewness: 1.809 -> Log Skewness: 0.495 (Near Normal)
##
## Variable Form Performance Comparison: Raw vs Log
## ============================================================
## Dependent variable:
## ----------------------------
## Crime Rate (Log)
## Raw Proportions Log Forms
## (1) (2)
## ------------------------------------------------------------
## Job Density -0.1229*** -0.1418***
## (0.0076) (0.0077)
##
## Unemployment 0.0776*** 0.0661***
## (0.0100) (0.0099)
##
## Deprivation 0.2482*** 0.2274***
## (0.0105) (0.0102)
##
## Youth (Raw) -0.0413***
## (0.0097)
##
## New Migrants (Raw) 0.3709***
## (0.0109)
##
## Youth (Log) 0.0325***
## (0.0087)
##
## New Migrants (Log) 0.3402***
## (0.0095)
##
## Constant 4.2291*** 4.2291***
## (0.0066) (0.0065)
##
## ------------------------------------------------------------
## Observations 4,988 4,988
## R2 0.3723 0.3855
## Adjusted R2 0.3717 0.3849
## Residual Std. Error (df = 4982) 0.4639 0.4590
## F Statistic (df = 5; 4982) 591.0408*** 625.0719***
## ============================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
##
## Conclusion: Please compare the Adjusted R-squared of the two models.
##
## Typically, the Log Model (Model 2) exhibits a higher R-squared and more significant t-values, indicating that the logarithmic form better captures the true underlying patterns of the data.
Finalizing the Analytical Dataset
Following the logarithmic transformations, the working dataset was updated to serve as the final analytical table.
##
## Success! New file generated: 'London_LSOA_Final_Model_Data_v4.csv'
## Newly included variables: log_youth, log_migrant
## Total number of variables: 40
Regression Analysis and Variable Selection Strategy
Following the completion of the exploratory analysis, we progressed to the regression modeling phase. Initially, models were specified based on theoretical frameworks and hypothesized explanatory factors. However, these preliminary specifications yielded suboptimal model fit and limited explanatory power. Consequently, to identify the most robust predictors and address potential redundancy, we employed the LASSO (Least Absolute Shrinkage and Selection Operator) technique for automated variable selection in the subsequent analysis.
## Current Sample Size: N = 4988
## Sample count including Westminster: 0 (These are key high-leverage points)
##
## LSOA Crime Rate Regression Results (Including Westminster)
## ===============================================================================================
## Dependent variable:
## ----------------------------------------------------------------------------
## Log Crime Rate
## Economic +Strain +Opportunity +Demog +Health(Core) Alt.Poverty Full Model
## (1) (2) (3) (4) (5) (6) (7)
## -----------------------------------------------------------------------------------------------
## Unemployment 0.140*** 0.124*** 0.134*** 0.074*** 0.050*** 0.062***
## (0.005) (0.006) (0.006) (0.006) (0.006) (0.006)
##
## Overcrowding 0.027***
## (0.003)
##
## Private Rented 0.005*** 0.019*** 0.022*** 0.022*** 0.017*** 0.009***
## (0.002) (0.002) (0.002) (0.001) (0.002) (0.002)
##
## Job Density (Log) 0.013***
## (0.001)
##
## Pop Density (Log) 1.026*** -0.017 0.416*** 0.437*** 0.293***
## (0.064) (0.068) (0.069) (0.069) (0.068)
##
## Youth (Log) -1.138*** -0.187** -0.643*** -0.656*** -0.526***
## (0.070) (0.071) (0.072) (0.072) (0.071)
##
## New Migrants (Log) 0.203*** 0.159*** 0.165*** 0.191***
## (0.028) (0.027) (0.027) (0.027)
##
## Bad Health 0.604*** 0.659*** 0.731*** 0.376***
## (0.023) (0.022) (0.023) (0.028)
##
## Deprivation 0.100*** 0.067*** 0.122***
## (0.005) (0.007) (0.005)
##
## Constant 3.560*** 3.575*** 5.116*** 3.524*** 3.677*** 3.460*** 3.900***
## (0.023) (0.024) (0.121) (0.118) (0.114) (0.114) (0.112)
##
## -----------------------------------------------------------------------------------------------
## Observations 4,988 4,988 4,988 4,988 4,988 4,988 4,988
## R2 0.154 0.157 0.199 0.352 0.399 0.400 0.431
## Adjusted R2 0.154 0.156 0.199 0.351 0.398 0.400 0.430
## ===============================================================================================
## Note: *p<0.05; **p<0.01; ***p<0.001
##
## --- Multicollinearity Diagnosis (VIF) ---
## pct_unemployed pct_overcrowded pct_private_rented log_job_density
## 2.389109 3.177242 2.878585 71.601300
## log_pop_density log_youth log_migrant pct_bad_health
## 71.986135 1.790695 3.664994 1.733615
check moodle and
LASSO Selection and Outlier Management
In this section, we applied the LASSO (Least Absolute Shrinkage and Selection Operator) method to implement a data-driven feature selection process. Concurrently, we refined the dataset by excluding high-leverage outliers, specifically the Westminster area, which had previously distorted model estimates due to its unique non-residential characteristics. Based on the LASSO selection results, we systematically removed variables exhibiting significant multicollinearity and reorganized the remaining predictors into a new, optimized variable combination for the final regression analysis.
## Data cleaning complete.
## Original sample size: 4988 -> Sample size after cleaning: 4865
## Lasso matrix preparation complete. Matrix dimensions: 4865 21
## Best Lambda selected by Lasso: 0.0002038313
##
## --- Variables Selected by Lasso and Their Coefficients ---
## Variable Coef
## 1 log_migrant 1.893884e-01
## 2 log_youth 1.813297e-01
## 3 log_pop_density -1.502182e-01
## 4 log_job_density -1.115536e-01
## 5 pct_bad_health 6.962884e-02
## 6 pct_deprived 2.859714e-02
## 7 pct_born_europe -2.717287e-02
## 8 pct_born_asia -2.651869e-02
## 9 pct_unemployed 2.374417e-02
## 10 pct_disabled 1.499216e-02
## 11 pct_born_africa -1.390243e-02
## 12 pct_hh_with_disabled -1.371695e-02
## 13 pct_private_rented 1.362659e-02
## 14 pct_no_religion 1.196217e-02
## 15 pct_overcrowded 9.273849e-03
## 16 pct_elementary_occup -5.329418e-03
## 17 pct_muslim 4.236854e-03
## 18 pct_no_qual 1.946132e-03
## 19 pct_christian 3.098296e-04
## 20 pct_born_americas -3.012018e-04
## 21 pct_level4_qual -8.121339e-05
##
##
## === Final OLS Regression Results (Using only variables selected by Lasso) ===
##
## Call:
## lm(formula = as.formula(formula_str), data = df_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.25508 -0.25566 -0.02484 0.21303 2.43584
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.3919011 1.1700845 8.027 1.25e-15 ***
## log_migrant 0.1835872 0.0326657 5.620 2.01e-08 ***
## log_youth 0.1752009 0.0279149 6.276 3.77e-10 ***
## log_pop_density -0.1467902 0.0775744 -1.892 0.058517 .
## log_job_density -0.1172906 0.0755337 -1.553 0.120530
## pct_bad_health 0.0684906 0.0103224 6.635 3.60e-11 ***
## pct_deprived 0.0283102 0.0038424 7.368 2.03e-13 ***
## pct_born_europe -0.0529002 0.0111592 -4.740 2.19e-06 ***
## pct_born_asia -0.0519122 0.0110915 -4.680 2.94e-06 ***
## pct_unemployed 0.0252033 0.0063451 3.972 7.23e-05 ***
## pct_disabled 0.0161960 0.0077047 2.102 0.035596 *
## pct_born_africa -0.0390266 0.0111424 -3.503 0.000465 ***
## pct_hh_with_disabled -0.0140946 0.0028949 -4.869 1.16e-06 ***
## pct_private_rented 0.0136975 0.0008717 15.713 < 2e-16 ***
## pct_no_religion 0.0117463 0.0012337 9.522 < 2e-16 ***
## pct_overcrowded 0.0093510 0.0024333 3.843 0.000123 ***
## pct_elementary_occup -0.0055036 0.0030131 -1.827 0.067825 .
## pct_muslim 0.0042815 0.0009974 4.293 1.80e-05 ***
## pct_no_qual 0.0010616 0.0031031 0.342 0.732287
## pct_christian 0.0004385 0.0010183 0.431 0.666767
## pct_born_americas -0.0266085 0.0117499 -2.265 0.023584 *
## pct_level4_qual -0.0013797 0.0014663 -0.941 0.346774
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4101 on 4843 degrees of freedom
## Multiple R-squared: 0.4805, Adjusted R-squared: 0.4783
## F-statistic: 213.3 on 21 and 4843 DF, p-value: < 2.2e-16
##
##
## --- Multicollinearity Diagnosis (VIF) ---
## log_migrant log_youth log_pop_density
## 5.593808 2.182870 96.681136
## log_job_density pct_bad_health pct_deprived
## 100.305988 7.657706 12.255740
## pct_born_europe pct_born_asia pct_unemployed
## 362.188016 307.378687 3.144761
## pct_disabled pct_born_africa pct_hh_with_disabled
## 6.861894 53.261983 8.324533
## pct_private_rented pct_no_religion pct_overcrowded
## 3.963041 6.251461 8.232763
## pct_elementary_occup pct_muslim pct_no_qual
## 7.365420 3.966967 11.412632
## pct_christian pct_born_americas pct_level4_qual
## 2.875272 32.865362 11.741762
check moodle
Model Optimization based on Combined Selection Strategies
Synthesizing the insights from the LASSO feature selection and the preliminary scatter plot correlation analysis, we curated the maximal set of relevant predictors for this iteration. The resulting model demonstrated superior performance metrics compared to previous specifications. Consequently, we proceeded to further refine and optimize the analysis based on this robust foundational model.
##
## Call:
## lm(formula = formula_top18, data = df_final)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.31287 -0.26592 -0.02926 0.22153 2.89818
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.8200878 0.0998342 48.281 < 2e-16 ***
## pct_unemployed 0.0187671 0.0063449 2.958 0.003113 **
## pct_private_rented 0.0139796 0.0008487 16.472 < 2e-16 ***
## pct_20_24 0.0115680 0.0033701 3.432 0.000603 ***
## pct_new_migrant 0.0214706 0.0020588 10.429 < 2e-16 ***
## pct_born_americas 0.0290802 0.0028539 10.190 < 2e-16 ***
## pct_overcrowded 0.0108474 0.0024654 4.400 1.11e-05 ***
## pct_deprived 0.0328997 0.0038943 8.448 < 2e-16 ***
## pct_bad_health 0.0817207 0.0103929 7.863 4.56e-15 ***
## pct_born_africa 0.0009449 0.0021997 0.430 0.667520
## pct_muslim 0.0019756 0.0008933 2.212 0.027038 *
## pct_elementary_occup -0.0027321 0.0028282 -0.966 0.334080
## pct_disabled 0.0109597 0.0077623 1.412 0.158039
## log_job_density -0.2465234 0.0098513 -25.024 < 2e-16 ***
## pct_buddhist 0.0537683 0.0097545 5.512 3.72e-08 ***
## pct_no_qual -0.0084951 0.0024551 -3.460 0.000544 ***
## pct_born_asia -0.0092158 0.0010388 -8.872 < 2e-16 ***
## pct_hh_with_disabled -0.0123193 0.0028644 -4.301 1.73e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4269 on 4970 degrees of freedom
## Multiple R-squared: 0.4696, Adjusted R-squared: 0.4678
## F-statistic: 258.9 on 17 and 4970 DF, p-value: < 2.2e-16
##
## --- Variance Inflation Factor (VIF) Check ---
## pct_unemployed pct_private_rented pct_20_24
## 2.971488 3.681908 2.606561
## pct_new_migrant pct_born_americas pct_overcrowded
## 5.710225 1.883874 7.968851
## pct_deprived pct_bad_health pct_born_africa
## 12.084619 7.722217 1.934892
## pct_muslim pct_elementary_occup pct_disabled
## 3.030105 6.113533 6.847286
## log_job_density pct_buddhist pct_no_qual
## 1.625039 1.199759 6.787904
## pct_born_asia pct_hh_with_disabled
## 2.527744 8.054382
check moodle
Spatial Fixed Effects Optimization
Building upon the previous specification, we further optimized the model by introducing Borough-level fixed effects to control for unobserved spatial heterogeneity across London’s administrative districts. A coefficient plot was generated to visualize the specific impact of these location effects. The final model demonstrated satisfactory goodness-of-fit, and multicollinearity diagnostics confirmed that variable variance inflation remained within acceptable limits.
## Number of Boroughs: 32
##
## Regression Results with Borough Fixed Effects
## =================================================
## Dependent variable:
## ---------------------------
## log_crime_rate
## -------------------------------------------------
## pct_unemployed 0.017***
## (0.006)
##
## pct_private_rented 0.015***
## (0.001)
##
## pct_20_24 0.008**
## (0.003)
##
## pct_new_migrant 0.021***
## (0.002)
##
## pct_born_americas 0.023***
## (0.004)
##
## pct_overcrowded 0.010***
## (0.003)
##
## pct_deprived 0.030***
## (0.004)
##
## pct_bad_health 0.069***
## (0.010)
##
## pct_born_africa 0.005*
## (0.003)
##
## pct_muslim 0.003**
## (0.001)
##
## pct_elementary_occup 0.002
## (0.003)
##
## pct_disabled 0.007
## (0.008)
##
## log_job_density -0.263***
## (0.010)
##
## pct_buddhist 0.038***
## (0.010)
##
## pct_no_qual -0.008***
## (0.003)
##
## pct_born_asia -0.006***
## (0.001)
##
## pct_hh_with_disabled -0.010***
## (0.003)
##
## Constant 5.023***
## (0.110)
##
## -------------------------------------------------
## Borough fixed effects Yes
## Observations 4,988
## R2 0.497
## Adjusted R2 0.493
## =================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
## GVIF Df GVIF^(1/(2*Df))
## pct_unemployed 3.175212 1 1.781913
## pct_private_rented 4.285733 1 2.070201
## pct_20_24 2.844538 1 1.686576
## pct_new_migrant 6.490345 1 2.547616
## pct_born_americas 3.546239 1 1.883146
## pct_overcrowded 8.683568 1 2.946790
## pct_deprived 12.722013 1 3.566793
## pct_bad_health 8.146750 1 2.854251
## pct_born_africa 2.677882 1 1.636423
## pct_muslim 5.101017 1 2.258543
## pct_elementary_occup 7.784596 1 2.790089
## pct_disabled 7.019535 1 2.649441
## log_job_density 1.833992 1 1.354250
## pct_buddhist 1.333016 1 1.154563
## pct_no_qual 8.395485 1 2.897496
## pct_born_asia 4.201771 1 2.049822
## pct_hh_with_disabled 8.801903 1 2.966800
## factor(Derived_Borough) 90.188382 31 1.075312
Exclusion of Ethnicity and Religion Variables Building upon the fixed-effects model, we conducted a sensitivity analysis by excluding variables related to ethnicity and religion. The results indicated that removing these factors had a negligible impact on the model’s overall explanatory power (Adjusted \(R^2\)). Consequently, adhering to the principle of model parsimony, these variables were excluded from the final model specification.
## Number of Boroughs: 32
##
## Refined Model Results (Ethnicity/Religion Excluded)
## =================================================
## Dependent variable:
## ---------------------------
## log_crime_rate
## -------------------------------------------------
## pct_unemployed 0.030***
## (0.006)
##
## pct_private_rented 0.015***
## (0.001)
##
## pct_20_24 0.007*
## (0.003)
##
## pct_new_migrant 0.020***
## (0.002)
##
## pct_overcrowded 0.007***
## (0.002)
##
## pct_deprived 0.033***
## (0.004)
##
## pct_bad_health 0.074***
## (0.010)
##
## pct_elementary_occup 0.006**
## (0.003)
##
## pct_disabled 0.013*
## (0.008)
##
## log_job_density -0.255***
## (0.010)
##
## pct_no_qual -0.012***
## (0.003)
##
## pct_hh_with_disabled -0.014***
## (0.003)
##
## Constant 5.060***
## (0.110)
##
## -------------------------------------------------
## Borough fixed effects Yes
## Observations 4,988
## R2 0.487
## Adjusted R2 0.482
## =================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
##
## --- Variance Inflation Factor (VIF) Check ---
## GVIF Df GVIF^(1/(2*Df))
## pct_unemployed 2.963209 1 1.721397
## pct_private_rented 3.937279 1 1.984258
## pct_20_24 2.786139 1 1.669173
## pct_new_migrant 6.352613 1 2.520439
## pct_overcrowded 6.648495 1 2.578468
## pct_deprived 12.583765 1 3.547360
## pct_bad_health 8.086237 1 2.843631
## pct_elementary_occup 6.930283 1 2.632543
## pct_disabled 6.841864 1 2.615696
## log_job_density 1.822892 1 1.350145
## pct_no_qual 8.099530 1 2.845967
## pct_hh_with_disabled 8.326912 1 2.885639
## factor(Derived_Borough) 9.847679 31 1.037580
# ==============================================================================
# Model 2 相关性矩阵检验:验证为何需要剔除 Deprivation
# ==============================================================================
# 1. 加载必要的包
if (!require("ggcorrplot")) install.packages("ggcorrplot")
library(tidyverse)
library(ggcorrplot)
# 2. 准备数据
# 提取 Model 2 中包含的所有变量 (含 Deprivation)
df_model2_corr <- df_clean %>%
select(
`Log Crime Rate` = log_crime_rate,
# 核心结构变量
`Unemployment` = pct_unemployed,
`Deprivation (IMD)` = pct_deprived, # 重点关注对象
`Overcrowding` = pct_overcrowded,
`Private Rented` = pct_private_rented,
# 脆弱性与健康
`Bad Health` = pct_bad_health,
`Disability` = pct_disabled,
`HH w/ Disabled` = pct_hh_with_disabled,
# 社会与人口
`Youth (20-24)` = pct_20_24,
`New Migrant` = pct_new_migrant,
`No Quals` = pct_no_qual,
`Elementary Occup` = pct_elementary_occup,
# 环境
`Log Job Density` = log_job_density
)
# 3. 计算相关性矩阵
corr_matrix_m2 <- cor(df_model2_corr, use = "complete.obs", method = "pearson")
# 4. 绘制热力图
p_corr_m2 <- ggcorrplot(corr_matrix_m2,
method = "square", # 方块样式
type = "lower", # 只显示下半部分
lab = TRUE, # 显示数值
lab_size = 2.5, # 字体稍微调小一点,因为变量多
tl.cex = 10, # 坐标轴标签大小
colors = c("#2E9FDF", "white", "#E7B800"), # 蓝-白-黄 配色
title = "Correlation Matrix: Model 2 (Highlighting Collinearity)",
ggtheme = theme_minimal() +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))
)
# 5. 展示并保存
print(p_corr_m2)
ggsave("Model2_Correlation_Matrix.png", p_corr_m2, width = 10, height = 10, bg = "white")
check moodle
Further Model Simplification and Multicollinearity Reduction
To further streamline the model, we reduced the number of variables by eliminating those exhibiting significant multicollinearity (as indicated by high VIF scores in the previous step). A new, parsimonious model was then generated to verify the stability and performance of this simplified specification.
## Number of Boroughs: 32
##
## =================================================
## Dependent variable:
## ---------------------------
## log_crime_rate
## -------------------------------------------------
## pct_unemployed 0.045***
## (0.006)
##
## pct_private_rented 0.016***
## (0.001)
##
## pct_20_24 0.005
## (0.003)
##
## pct_new_migrant 0.022***
## (0.002)
##
## pct_overcrowded 0.013***
## (0.002)
##
## pct_bad_health 0.088***
## (0.010)
##
## pct_elementary_occup 0.010***
## (0.003)
##
## pct_disabled 0.014*
## (0.008)
##
## log_job_density -0.250***
## (0.010)
##
## pct_no_qual -0.005**
## (0.003)
##
## pct_hh_with_disabled -0.003
## (0.003)
##
## Constant 4.802***
## (0.106)
##
## -------------------------------------------------
## Borough fixed effects Yes
## Observations 4,988
## R2 0.479
## Adjusted R2 0.475
## =================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
##
## --- Variance Inflation Factor (VIF) Check ---
## GVIF Df GVIF^(1/(2*Df))
## pct_unemployed 2.729548 1 1.652134
## pct_private_rented 3.872885 1 1.967965
## pct_20_24 2.773750 1 1.665458
## pct_new_migrant 6.297779 1 2.509538
## pct_overcrowded 5.896427 1 2.428256
## pct_bad_health 7.896560 1 2.810082
## pct_elementary_occup 6.818661 1 2.611257
## pct_disabled 6.839791 1 2.615299
## log_job_density 1.815474 1 1.347395
## pct_no_qual 7.296181 1 2.701144
## pct_hh_with_disabled 6.534689 1 2.556304
## factor(Derived_Borough) 9.296159 31 1.036616
Variable Reduction for Optimal Specification
We proceeded with further variable screening and reduction to identify the optimal combination of predictors for the final model.
## Number of Boroughs: 32
##
## =================================================
## Dependent variable:
## ---------------------------
## log_crime_rate
## -------------------------------------------------
## pct_unemployed 0.048***
## (0.006)
##
## pct_private_rented 0.016***
## (0.001)
##
## pct_20_24 0.005
## (0.003)
##
## pct_new_migrant 0.022***
## (0.002)
##
## pct_overcrowded 0.016***
## (0.002)
##
## pct_bad_health 0.085***
## (0.010)
##
## pct_disabled 0.010
## (0.007)
##
## log_job_density -0.250***
## (0.010)
##
## pct_no_qual -0.002
## (0.002)
##
## Constant 4.747***
## (0.096)
##
## -------------------------------------------------
## Borough fixed effects Yes
## Observations 4,988
## R2 0.478
## Adjusted R2 0.474
## =================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
## GVIF Df GVIF^(1/(2*Df))
## pct_unemployed 2.626996 1 1.620801
## pct_private_rented 3.387598 1 1.840543
## pct_20_24 2.606428 1 1.614444
## pct_new_migrant 5.982817 1 2.445980
## pct_overcrowded 4.834472 1 2.198743
## pct_bad_health 7.270813 1 2.696445
## pct_disabled 6.450482 1 2.539780
## log_job_density 1.814910 1 1.347186
## pct_no_qual 5.308752 1 2.304073
## factor(Derived_Borough) 6.720201 31 1.031205
Final Model Refinement and Conclusion
Building upon the previous iteration, we further refined the model by eliminating variables characterized by high multicollinearity or limited explanatory contribution. The resulting streamlined model achieves an optimal balance between model parsimony and explanatory power. This step marks the culmination of the final predictor selection process and concludes the regression analysis phase.
## Number of Boroughs: 32
##
## =================================================
## Dependent variable:
## ---------------------------
## log_crime_rate
## -------------------------------------------------
## pct_unemployed 0.048***
## (0.006)
##
## pct_private_rented 0.016***
## (0.001)
##
## pct_20_24 0.004
## (0.003)
##
## pct_new_migrant 0.023***
## (0.002)
##
## pct_overcrowded 0.015***
## (0.002)
##
## pct_bad_health 0.094***
## (0.005)
##
## log_job_density -0.250***
## (0.010)
##
## Constant 4.746***
## (0.091)
##
## -------------------------------------------------
## Borough fixed effects Yes
## Observations 4,988
## R2 0.478
## Adjusted R2 0.474
## =================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
## GVIF Df GVIF^(1/(2*Df))
## pct_unemployed 2.614383 1 1.616905
## pct_private_rented 3.300686 1 1.816779
## pct_20_24 2.465941 1 1.570332
## pct_new_migrant 5.166073 1 2.272900
## pct_overcrowded 3.632450 1 1.905899
## pct_bad_health 1.887181 1 1.373747
## log_job_density 1.780076 1 1.334195
## factor(Derived_Borough) 4.934313 31 1.026080
check moodle
Visualizing Relative Explanatory Power
Based on the variables selected for the final model, we generated a horizontal bar chart to visualize the relative explanatory power of each predictor. To ensure comparability across variables with different units, standardized coefficients were calculated, allowing for a direct assessment of which factors exert the strongest influence on crime rates.
## Number of boroughs: 32
##
## =================================================
## Dependent variable:
## ---------------------------
## log_crime_rate
## -------------------------------------------------
## pct_unemployed 0.048***
## (0.006)
##
## pct_private_rented 0.016***
## (0.001)
##
## pct_20_24 0.004
## (0.003)
##
## pct_new_migrant 0.023***
## (0.002)
##
## pct_overcrowded 0.015***
## (0.002)
##
## pct_bad_health 0.094***
## (0.005)
##
## log_job_density -0.250***
## (0.010)
##
## Constant 4.746***
## (0.091)
##
## -------------------------------------------------
## Borough fixed effects Yes
## Observations 4,988
## R2 0.478
## Adjusted R2 0.474
## =================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
# ==============================================================================
# 逐一展示并保存边际效应图
# ==============================================================================
library(tidyverse)
library(ggeffects)
# 1. 再次确保数据和模型是正确的 (避免之前的 factor 报错)
df_clean <- df_clean %>%
mutate(Derived_Borough = as.factor(Derived_Borough))
vars_top <- c("pct_unemployed", "pct_private_rented", "pct_20_24",
"pct_new_migrant", "pct_overcrowded", "pct_bad_health",
"log_job_density")
# 使用清洗后的公式 (去掉 formula 里的 factor() 调用)
formula_clean <- as.formula(
paste("log_crime_rate ~", paste(vars_top, collapse = " + "), "+ Derived_Borough")
)
model_clean <- lm(formula_clean, data = df_clean)
# 2. 设定我们要看的三个核心变量
vars_of_interest <- c("pct_unemployed", "pct_private_rented", "log_job_density")
# 3. 循环生成、展示并保存每一张图
for (var in vars_of_interest) {
# 计算边际效应
eff <- ggpredict(model_clean, terms = var)
# 绘图
p <- plot(eff) +
labs(
title = paste("Marginal Effect Analysis:", var), # 标题
y = "Predicted Log Crime Rate",
x = var
) +
theme_minimal() +
theme(
plot.title = element_text(face = "bold", size = 14),
axis.title = element_text(size = 12)
)
# [关键步骤] 逐一打印到屏幕
print(p)
# [可选步骤] 逐一保存为单独的图片文件
# 文件名会自动根据变量名生成,如 "Effect_pct_unemployed.png"
filename <- paste0("Effect_", var, ".png")
ggsave(filename, p, width = 6, height = 5, bg = "white")
cat("已保存图片:", filename, "\n")
}
## 已保存图片: Effect_pct_unemployed.png
## 已保存图片: Effect_pct_private_rented.png
## 已保存图片: Effect_log_job_density.png